Memristor based multithreading

ABSTRACT

A method and a device that includes a set of multiple pipeline stages, wherein the set of multiple pipeline stages is arranged to execute a first thread of instructions; multiple memristor based registers that are arranged to store a state of another thread of instructions that differs from the first thread of instructions; and a control circuit that is arranged to control a thread switch between the first thread of instructions and the other thread of instructions by controlling a storage of a state of the first thread of instructions at the multiple memristor based registers and by controlling a provision of the state of the other thread of instructions by the set of multiple pipeline stages; wherein the set of multiple pipeline stages is arranged to execute the other thread of instructions upon a reception of the state of the other thread of instructions.

BACKGROUND

The following references illustrate the state of the art :

-   -   [1] R. Gabor, S. Weiss, and A. Mendelson, “Fairness Enforcement        is Switch On Event Multithreading,” ACM Transactions on        Architecture and Code Optimization, Vol. 4, No. 3, Article 15,        pp. 1-34, September 2007.    -   [2] J. M. Borkenhagen, R. J. Eickemeyer, R. N. Kalla, and S. R.        Kunkel, “A Multithreaded PowerPC Processor for Commercial        Servers,” IBM Journal of Research and Development, Vol. 44, No.        6, pp. 885-898, November 2000.    -   [3] C. McNairy and R. Bhatia, “Montecito—The Next Product in the        Itanium Processor Family,” Hot Chips 16, August 2004.    -   [4] B. J. Smith, “Architecture and Applications of the HEP        Multiprocessor Computer System,” Proceedings of SPIE Real Time        Signal Processing IV, pp. 241-248, 1981.    -   [5] L. Gwennap, “Sandy Bridge Spans Generations,” Microprocessor        Report (www.MPRonline.com), September 2010.    -   [6] R. Waser and M. Aono, “Nanoionics-based Resistive Switching        Memories,” Nature Materials, Vol. 6, pp. 833-840, November 2007.    -   [7] Y. Huai, “Spin-Transfer Torque MRAM (STT-MRAM) Challenges        and Prospects,” AAPPS Bulletin, Vol. 18, No. 6, pp. 33-40,        December 2008.    -   [8] L. 0. Chua, “Memristor—the Missing Circuit Element,” IEEE        Transactions on Circuit Theory, Vol. 18, No. 5, pp. 507-519,        September 1971.    -   [9] R. Waser, R. Dittmann, G. Staikov, and K. Szot, “Redox-Based        Resistive Switching Memories—Nanoionic Mechanisms, Prospects,        and Challenges,” Advanced Materials, Vol. 21, No. 25-26, pp.        2632-2663, July 2009.    -   [10] B. C. Lee, E. Ipek, 0. Mutlu, and D. Burger, “Architecting        Phase Change Memory as a Scalable DRAM Alternative,” Proceedings        of the Annual International Symposium on Computer Architecture,        pp. 2-13, June 2009.    -   [11] M. N. Kozicki and W. C. West, “Programmable Metallization        Cell Structure and Method of Making Same,” U. S. Pat. No.        5,761,115, June 1998.    -   [12] J. F. Scott and C. A. Paz de Araujo, “Ferroelectric        Memories,” Science, Vol. 246, No. 4936, pp. 1400-1405, December        1989.    -   [13] Z. Diao et al, “Spin-Transfer Torque Switching in Magnetic        Tunnel Junctions and Spin-Transfer Torque Random Access Memory,”        Journal Of Physics: Condensed Matter, Vol. 19, No. 16, pp. 1-13,        165209, April 2007.    -   [14] International Technology Roadmap for Semiconductor (ITRS),        2009.    -   [15] A. C. Torrezan, J. P. Strachan, G. Medeiros-Riveiro,        and R. S. Williams, “Sub-Nanosecond Switching of a Tantalum        Oxide Memristor,” Nanotechnology, Vol. 22, No. 48, pp. 1-7,        December 2011.    -   [16] J. Nickel, “Memristor Materials Engineering: From Flash        Replacement Towards a Universal Memory,” Proceedings of the IEEE        International Electron Devices Meeting, December 2011.    -   [17] Z. Guz, E. Bolotin, I. Keidar, A. Kolodny, A. Mendelson,        and U. C. Weiser, “Many-Core vs. Many-Thread Machines: Stay Away        From the Valley,” Computer Architecture Letters, Vol. 8, No. 1,        pp. 25-28, May 2009.    -   [18] D. M. Tullsen, S. J. Eggers, and H. M. Levy, “ Simultaneous        Multithreading: Maximizing On-Chip Parallelism,” Proceedings of        the Annual International Symposium on Computer Architecture, pp.        392-403, June 1995.    -   [19] J. W. Haskins, K. R. Hirst, and K. Skadron, “ Inexpensive        Throughput Enhancement in Small-Scale Embedded Microprocessors        with Block Multithreading: Extensions, Characterization, and        Tradeoffs,” Proceedings of the IEEE International Conference on        Performance, Computing, and Communications, pp. 319- 328, April        2001.    -   [20] M. K. Farrens and A. R. Pleszkun, “Strategies for Achieving        Improved Processor Throughput,” Proceedings of the Annual        International Symposium on Computer Architecture, pp. 362-369,        May 1991.    -   [21] http://www.m5sim.org/[22]    -   [22] SPEC CPU2006 benchmark suite. http://www.spec.org/cpu2006/

Multithreading processors have been used to improve performance in asingle core for the past two decades. One low power and low complexitymultithreading technique is Switch on Event multithreading (SoE MT, alsoknown as coarse grain multithreading and block multithreading) [1], [2],[3], [20], where a thread runs inside the pipeline until an event occurs(e.g., a long latency event like a cache miss) and triggers a threadswitch. The state of the replaced thread is maintained by the processor,while the long latency event is handled in the background. While athread is switched, the in-flight instructions are flushed. The timerequired to refill the pipeline after a thread switch is referred to asthe switch penalty. The switch penalty is usually relatively high, makesSOE MT less popular than simultaneous multithreading (SMT) [18] andfine-grain multithreading (interleaved multithreading) [4]. Whilefine-grain MT is worthwhile only for a large number of threads, theperformance of SMT is limited in practice due to limitations on thenumber of supported threads (e.g., two for Intel Sandy Bridge [5]).

SUMMARY OF THE INVENTION

According to an embodiment of the invention various methods may beprovided and are described in the specification. Additional embodimentsof the invention include a device that may be arranged to execute any orall of the methods described in the specification above, including anystages-and any combinations of same.

According to an embodiment of the invention there may be provided adevice. The device may include (a) a set of multiple pipeline stages,wherein the set of multiple pipeline stages is arranged to execute afirst thread of instructions; (b) multiple memristor based registersthat are arranged to store a state of another thread of instructionsthat differs from the first thread of instructions; and (c) a controlcircuit that is arranged to control a thread switch between the firstthread of instructions and the other thread of instructions bycontrolling (i) a storage of a state of the first thread of instructionsat the multiple memristor based registers and (ii) a provision of thestate of the other thread of instructions by the set of multiplepipeline stages. The set of multiple pipeline stages may arranged toexecute the other thread of instructions upon a reception of the stateof the other thread of instructions. The first thread of instructionsmay also be referred to an active thread and the other threads may bereferred to as inactive threads. When a thread switch occurs the firstthread may become a previously active thread and one other thread maybecome the new active thread.

The memristor based registers may include any resistive memory elementssuch as but not limited to -spin torque transfer magnetoresistive memoryelements or may include resistive memory elements.

The resistive memory elements may be formed in close proximity to themultiple pipeline stages.

The resistive memory elements may be positioned directly above portionsof the set of multiple pipeline stages.

The duration of the thread switch may not exceed the period that maytake to refill the pipeline. For example, it may not exceed ten, five orthree clock cycles of a clock signal provided to the set of multiplepipeline stages.

Each pipeline stage may be followed by a memristor based register.

The storage of the state of the first thread of instructions at themultiple memristor based registers may be preceded by extracting thestate of the first thread of instructions from the set of multiplepipeline stages. The aggregate duration of the extracting of the stateof the first thread of instructions and the storage of the state of thefirst thread of instructions may exceeds the duration of the provisionof the state of the other thread of instructions.

The multiple memristor based registers may be arranged to store a stateof each one out of multiple (n) other thread of instructions that differfrom the first thread of instructions; and the control circuit may bearranged to control thread switches between any instructions out of thefirst thread if instructions and any one of the other threads ofinstructions.

The number (n) of other (inactive) threads may exceed 2, 3, 5, and 7, 9,10, 12, 20, 30 and even more.

The multiple memristor based registers may include multiple layers,wherein each layer is dedicated for storing the status of a single otherthread of instructions. It is noted that the term layer may refer to anygroup of memory elements of the memristor based registers and that asingle layer may store the status related to multiple threads.

The memristor based registers may include resistive memory elements; andeach other thread of instructions is stored in a memristive-based layerof the memristor based registers.

BRIEF DESCRIPTION OF THE DRAWINGS

The subject matter regarded as the invention is particularly pointed outand distinctly claimed in the concluding portion of the specification.The invention, however, both as to organization and method of operation,together with objects, features, and advantages thereof, may best beunderstood by reference to the following detailed description when readwith the accompanying drawings in which:

FIG. 1 illustrates a device according to an embodiment of the invention;

FIG. 2 illustrates a register of the device of FIG. 1 according to anembodiment of the invention;

FIG. 3 illustrates a portion of the device of FIG. 1 according to anembodiment of the invention;

FIG. 4A and FIG. 4B illustrate an execution of instructions according toan embodiment of the invention;

FIG. 5 illustrates a comparison between the performance of a deviceaccording to an embodiment of the invention and a prior art device;

FIG. 6 illustrates a portion of a device according to an embodiment ofthe invention;

FIG. 7 illustrates a device according to an embodiment of the invention;and

FIG. 8 illustrates a method according to an embodiment of the invention.

It will be appreciated that for simplicity and clarity of illustration,elements shown in the figures have not necessarily been drawn to scale.For example, the dimensions of some of the elements may be exaggeratedrelative to other elements for clarity. Further, where consideredappropriate, reference numerals may be repeated among the figures toindicate corresponding or analogous elements.

DETAILED DESCRIPTION OF THE DRAWINGS

In the following detailed description, numerous specific details are setforth in order to provide a thorough understanding of the invention.However, it will be understood by those skilled in the art that thepresent invention may be practiced without these specific details. Inother instances, well-known methods, procedures, and components have notbeen described in detail so as not to obscure the present invention.

The subject matter regarded as the invention is particularly pointed outand distinctly claimed in the concluding portion of the specification.The invention, however, both as to organization and method of operation,together with objects, features, and advantages thereof, may best beunderstood by reference to the following detailed description when readwith the accompanying drawings.

It will be appreciated that for simplicity and clarity of illustration,elements shown in the figures have not necessarily been drawn to scale.For example, the dimensions of some of the elements may be exaggeratedrelative to other elements for clarity. Further, where consideredappropriate, reference numerals may be repeated among the figures toindicate corresponding or analogous elements.

There is provided a Continuous Flow Multithreading (CFMT), a novelmicroarchitecture. The primary concept of CFMT is to support SoE MT fora large number of threads through the use of multistate pipelineregisters (MPRs). These MPRs store the intermediate state of allinstructions of inactive threads, eliminating the need to flush thepipeline on thread switches. This new machine is as simple as a regularSoE MT, and has higher energy efficiency while improving the performanceas compared to regular SoE MT.

Hirst et al extends the SoE MT to differential multithreading (dMT)[19], proposing two threads running simultaneously in a single scalarpipeline for low cost microprocessors. CFMT takes a broader view ofadvanced SoE MT microarchitectures. CFMT extends SoE MT by enabling theuse of numerous threads using multistate pipeline registers in deeppipeline machines. CFMT is applicable to any execution event that cancause a pipeline stall.

The development of new memory technologies, such as RRAM (Resistive RAM)[6] and STT-MRAM (Spin-Transfer Torque Magnetoresistive RAM) [7],enables MPRs since these devices are located in metal layers above thelogic cells and are fast, dense, and power efficient. These memorytechnologies are referred to as memristors [8], [9].

Continuous Flow Multithreading (CFMT)

To reduce the thread switch penalty, a new thread switching mechanismfor SOE MT is proposed. In CFMT, pipeline registers are replaced byMPRs, as shown in FIG. 1.

In FIG. 1 a set of memristor based registers such as multistate pipelineregisters (MPRs) 40(1)-40(J) is located between every two pipelinestages 30(1)-30(J+1). For simplicity of explanation various pipelineregisters that are placed after pipeline stages 30(J+1) and 30(K) arenot shown.

Each MPR maintains a single bit (or multiple bits) of the state of aninstruction from all threads. The number of MPRs corresponds to thenumber of bits required to store the entire state of an instruction inthe specific pipeline stage.

For each pipeline stage, an MPR stores the state of the instructionsfrom all threads. Thus, in the case of a thread switch (controlled bycontrol unit 20), there is no need to flush all subsequent instructions.The processor 12 saves the state of each instruction from the switchedthread in the relevant MPR in each pipeline stage, while handling theoperation of the long latency instruction in the background.Instructions from the new active thread are inserted into the pipelinefrom the MPR, creating a continuous flow of instructions within thepipeline. When no thread switching is required, the pipeline operates asa regular pipeline and each MPR operates as a conventional pipelineregister. FIG. 1 illustrates the MPRs as including four levels 41-44—onelevel per thread (thread 4, thread 3, thread 2 and thread 1). Theselayers (41-44) may be physical layers or differ from physical layers.Thus allowing the processor to execute a thread (active thread) ofinstructions while storing the status of four other (inactive) threadsof instructions. It is noted that the number of threads that can bestored by the processor 12 can differ from 4, and may exceed four.

FIG. 1 also illustrates (see for example dashed arrow 51(2)) the storageof the state of a previously active thread of instructions to layer 43of the MPRs and the retrieval of the state of the new active thread (seefor example dashed arrow 52(2)) that was previously stored at layer 43to the pipeline stages.

FIG. 7 differs from FIG. 1 by showing intermediate buffers 80(1)-80(J)that store the state of the previously active thread of instructionsbefore that state is stored at the appropriate layer of the MPR.

It is noted that although FIGS. 1 and 7 show that the status of newactive thread of instructions is retrieved from the same layer that isused to store the state of the previously active thread ofinstructions—but this is not necessarily so—especially if at a givenmoment there are multiple vacant layers. Thus—the status of a new activethread of instructions is retrieved from the layer that differs from alayer that is used to store the state of the previously active thread ofinstructions.

When the long latency instruction is completed, the result is writtendirectly into the MPR in the background. In CFMT, the thread switchpenalty is determined by the time required to change the active threadin the MPR, i.e., the time required to read the state of the new,previously inactive thread from the MPR. For a fast MPR, the threadswitch penalty is significantly lower than in conventional SOE MT andthe performance therefore increases significantly.

Multi-State Pipeline Register (MPR)

The logic structure of a multistate pipeline register (MPR) is shown inFIG. 2.

FIG. 2 illustrates the logic structure of a multistate pipeline register(MPR) 40(1) according to an embodiment of the invention. MPR 40(1)includes multiple layers 40(1, 1)-40(1, n) for storing the state ofpipeline stage 30(1) for each thread out of up to n threads ofinstructions.

An MPR maintains a single bit (or multiple bits) of the state of aninstruction from all threads (stores a multiple integer of n bits ofdata), where only one thread is active at a time. The MPR issynchronized by the processor clock (125) and can switch the activethread in response to a reception of a switching enable 122 trigger.

Each MPR stores data (status) for multiple threads, one or more bit perthread. The total size of an MPR is therefore a multiple integer of nbits, where n is the maximum number of threads. For each pipeline stage,the state of the thread of instructions is stored in a set of MPRs withcommon control signals for thread management and switching. The MPR hasone active thread (the current thread) for which the data can be readand written during operation of the processor, as in a regular pipelineregister. During a thread switch, the active thread changes (asindicated by active tread select signal 123) while the state of thepreviously active thread (thread select 121 indicates where to store thepreviously active thread) is maintained in the MPR. The data (status) isreceived via status in port 124 and is outputted via status out port126. The MPR can therefore store data for all threads running in themachine. The time required to change the active thread in the MPRdepends on the specific circuit structure of the MPR. This timedetermines the thread switch penalty of CFMT. A typical thread switchpenalty in CMFT is in the range of 1 to 3 clock cycles (or may be evenmore while being smaller than the penalty associated with SOE MT(typically 8 to 15 clock cycles).

Emerging Memory Technologies

Over the past decade, new technologies have been considered as potentialreplacements for the traditional SRAM/DRAM-based memory system toovercome scaling issues, such as greater leakage current. These emergingtechnologies include PCM (Phase Change Memory) [10], PMC (ProgrammableMetallization Cell, also known as CBRAM) [11], FeRAM (Ferroelectric RAM)[12], RRAM (Resistive RAM) [9], and STT-MRAM (Spin Transfer TorqueMagnetoresistive RAM) [13].

While the physical mechanism for these emerging memory technologies isdifferent, all of these technologies are nonvolatile with varyingresistance and can therefore be considered as memristors [8]. Theseemerging memory technologies are fabricated by introducing a specialinsulator layer between two layers of metal which can be integrated intoa CMOS process, stacked vertically in multilayer metal structuresphysically above the active silicon transistors. This fabricationtechnique provides a high density of memory bits above a small area ofactive silicon. Memristive memory cell sizes are approximately 1 to 4 F²for RRAM and 8 to 45 F² for STT-MRAM, as compared to SRAM (60 to 175 F²)and DRAM (4 to 15 F²) [14], where F is the minimum feature size in thetechnology.

RRAM and STT-MRAM are both relatively fast [15]. STT-MRAM does notexhibit any endurance issues, while it is believed that the enduranceissue of RRAM will be overcome in the near future [16]. Since memristorsare dense, fast, and power efficient, these devices are attractive foruse within the processor as an MPR.

FIG. 3 shows multiple layers of memristors 40(1, 1)-40(1, n) that belongto MPR 40(1). Each thread (out of up to n threads) has its ownmemristor-based layer out of layers 40(1, 1)-40(1, n), where every bitis stored in a single memristor. The active thread is executed bypipeline stage 30(1) and its state may be stored in flip-flops 31(1) ofa CMOS layer.

During regular operation of the pipeline, only the CMOS layer is active(blue line) and all memristor-based layers are disabled, exploiting thenon-volatility of the memristors to save power.

During a thread switch, the data from the CMOS layer is retrieved(dashed arrow 201) from pipeline stage 30(1), and is (after selecting inresponse to thread select signal 121—the layer to store the status ofthe previously active thread—box 210) written into (dashed line 202) tothe selected relevant memristor-based layer, while the state of the newactive thread (dashed line 202) is sensed and read (220) and transferred(dashed line 204) to the next pipeline stage 30(2).

For a memristor-based MPR, each thread has its own memristor-basedlayer, while the bottom CMOS layer is used for the active thread runningwithin the pipeline. The bottom layer consists of standard CMOS pipelineregisters, compatible with CMOS logic. During a thread switch, data iscopied from the CMOS layer to a specific memristor-based layer thatcorresponds to the previously active thread. The data from the newactive thread is read into the next pipeline stage that receives thestate of the new thread. When no thread switch occurs, only the bottomCMOS layer is active and the memristor layers are in standby mode. It ispossible to completely disable the memristor layers and save power dueto the nonvolatility of memristors.

FIG. 6 illustrates a portion 11 of device 10 according to an embodimentof the invention. The portion 11 includes the flip flops 31(1) ofpipeline stage 30(1) and layers 40(1,1)-40(1,4) of MPR 40(1). Layers40(1, 1)-40(1, 4) are implemented in four metal layers 71-74 while flipflops are implemented in a silicone layer 70-1. FIG. 6 shows that layers40(1,1)-40(1,4) are positioned directly above flip flops 31(1)—and thusthe distance between these components is very short—contributing to thevery fast retrieval and fetching of status between flip-flops 31(1) andthe relevant layer.

FIG. 6 also shows that a memristor of memristor layer 40(1, 4) caninclude two metal layer conductors 90 and 91 and a memristor interface92 that connects these two conductors.

To determine the thread switch penalty for a memristor-based MPR, onlysensing the memristor layer of the new active thread is considered sincethe copy operation of the bottom CMOS layer to a memristor layer can bemasked using buffers. This latency is determined by the read time of amemristor (sensing the data in the memristive layer). Due to the highdensity of memristors, our preliminary design of the memristor-based MPRshows that the area overhead can be neglected (less than 0.1% of thepipeline area for 16 active threads). This overhead is primarily due tothe write mechanism and can be further optimized by separating the readand write mechanisms.

Performance Analysis

The performance (in CPI—cycles per instruction) of an SoE processordepends upon whether the number of threads is sufficient to overlap longlatency events. Two regions of operation exist in SoE processors,depending upon the number of threads running in the machine. Theunsaturated region is the region where the number of threads is fewerthan the number required for concealing a long latency event. Thebehavior of the pipeline in this region is illustrated in FIG. 4a . Theanalytic model assumes that the execution behavior in the pipeline isperiodic. The period is determined by the execution of 1/r_(m)instructions (for example five instructions) from the same thread, wherer_(m) is the average fraction of memory operations in the instructionstream. One instruction is a long latency instruction (i.e., theinstruction that triggers the thread switch; in this paper, an L1 cachemiss is assumed as the trigger, with a miss penalty of P_(m) cycles) andthe remaining instructions are low latency instructions with an averageCPI of CPI_(ideal). During execution of the long latency instruction,other instructions from different threads run within the machine. Forthese instructions, a periodic behavior is again assumed which alsotriggers a thread switch. For the unsaturated region, it is assumed thatthere is an insufficient number of instructions to overlap the P_(m)cycles required to execute the long latency instruction. The CPI in theunsaturated region is

$\begin{matrix}{{{CPI}_{unsat} = \frac{{CPI}_{ideal} + {P_{m} \cdot r_{m} \cdot {{MR}(n)}}}{n}},} & (1)\end{matrix}$

where n is the number of threads running in the machine and MR(n) is themiss rate of the L1 cache. Note that CPI_(unsat) is limited byCPI_(sat), determined in (2).

When a sufficient number of threads runs on the machine, the longlatency instruction can be completely overlapped, and a second region,named the saturation region, is reached. In the saturation region, thethread switch penalty (P_(s) clock cycles) influences the behavior,which effectively limits the number of threads (above a specific numberof threads there is no change in performance). The behavior of thepipeline in the saturation region is illustrated in FIG. 4b . Assume allof the threads exhibit the same average behavior andP_(m)>>CPI_(ideal)/r_(m) (i.e., the miss penalty is significantly longerthan the execution time of the short latency instructions).

In both FIGS. 4a and 4b each box (310 in FIG. 4A and 320 in FIG. 4B) isan instruction. The numbers indicate the thread number. Thus the fivefirst instructions belong to thread 1, the second five instructionsbelong to thread 2, the third five instructions belong to thread 3, thefourth five instructions belong to thread 4 and the fifth fiveinstructions belong to thread 5.

The CPI in the saturation region is

CPI_(sat)=CPI_(ideal) +P _(s) ·r _(m) ·MR(n)   (2)

In a conventional SOE MT, the switch penalty P_(s) is determined by thenumber of instructions flushed during each switch. In CFMT, however, theswitch penalty is the MPR read time T_(m), i.e., the time required toread the state from the MPR and transfer this state to the next pipelinestage. In the case of a memristor-based MPR, the switch penalty is thetime required to read the data from the memristor layer. From (2), ifthe value of T_(m) is lower than P_(s), the performance of the processorin the saturation region is significantly improved, where the speedup is

$\begin{matrix}{{Speedup}_{sat} = {1 + {\frac{r_{m} \cdot {{MR}(n)}}{{CPI}_{ideal} + {T_{m} \cdot r_{m} \cdot {{MR}(n)}}} \cdot {\left( {P_{s} - T_{m}} \right).}}}} & (3)\end{matrix}$

Note that in the unsaturated region, the exact CPI of the CFMT isslightly better (lower) than a conventional SoE MT processor due to theimproved switch penalty. The Instruction per cycle (IPC, whereinIPC=1/CPI) of the proposed machine as compared to a conventional SoEmachine is shown in FIG. 5.

The IPC of the Continuous Flow MT (CFMT) (curves 410 and 420) ascompared to a conventional SoE MT processor (curve 430). The memristorread time, which determines the thread switch penalty, is three clockcycles and one clock cycles. The IPC of CFMT is twice greater (2×improvement) than a conventional SOE MT for T_(m)=1 cycle for a constantmiss rate, MR=0.25, r_(m)=0.25, P_(s)=20 cycles, and P_(m)=200 cycles.

The performance of the proposed machine exhibits a 2× performanceimprovement for a constant miss rate when operating in the saturationregion. For varying miss rates (particularly with large P_(m)), thebehavior of the CPI is similar to the behavior reported in [17].Preliminary simulations have been performed on GEMS [21], exhibiting asaturation performance improvement of approximately 50% for the SPEC MCFbenchmark [22].

Because the illustrated embodiments of the present invention may for themost part, be implemented using electronic components and circuits knownto those skilled in the art, details will not be explained in anygreater extent than that considered necessary as illustrated above, forthe understanding and appreciation of the underlying concepts of thepresent invention and in order not to obfuscate or distract from theteachings of the present invention.

Any reference in the specification to a method should be applied mutatismutandis to a system capable of executing the method and should beapplied mutatis mutandis to a non-transitory computer readable mediumthat stores instructions that once executed by a computer result in theexecution of the method.

Any reference in the specification to a system should be applied mutatismutandis to a method that may be executed by the system and should beapplied mutatis mutandis to a non-transitory computer readable mediumthat stores instructions that may be executed by the system.

Any reference in the specification to a non-transitory computer readablemedium should be applied mutatis mutandis to a system capable ofexecuting the instructions stored in the non-transitory computerreadable medium and should be applied mutatis mutandis to method thatmay be executed by a computer that reads the instructions stored in thenon-transitory computer readable medium.

FIG. 8 illustrates method 50 according to an embodiment of theinvention.

Method 500 may start by stages 510 and 520.

Stage 510 may include executing, by a set of multiple pipeline stages, afirst thread of instructions.

Stage 520 may include storing, by multiple memristor based registers, astate of another thread of instructions that differs from the firstthread of instructions.

Stage 520 may include:

-   -   a. Storing by the multiple memristor based registers a state of        each one out of multiple other thread of instructions that        differ from the first thread of instructions.    -   b. Storing by each layer a status of a single other thread of        instructions.    -   c. Storing each other thread of instructions is in a        memristive-based layer of the memristor based registers.

Stage 510 and 520 may be followed by stage 530 of executing a threadswitch between the first thread of instructions and the other thread ofinstructions. The executing of the thread switch may include storing astate of the first thread of instructions at the multiple memristorbased registers; and providing the state of the other thread ofinstructions by the set of multiple pipeline stages. The state of theother thread of instructions facilitates an executing of the otherthread of instructions. The memristor based registers may include spintorque transfer magnetoresistive memory elements or resistive memoryelements.

Stage 530 is followed by stage 510 and 520 wherein the other thread ofinstructions (that state of which was fed to the pipeline stages)becomes the first thread (or active thread) of instructions and thepreviously active thread (previously first thread) becomes an inactivethread (another thread).

Stage 530 may include extracting the state of the first thread ofinstructions from the set of multiple pipeline stages. The aggregateduration of the extracting of the state of the first thread ofinstructions and the storing of the state of the first thread ofinstructions may exceed a duration of the provision of the state of theother thread of instructions.

Stage 530 may include executing a thread switch between any thread ofinstructions out of the first thread of instructions and the multipleother threads of instructions.

Stage 910 may be followed by stage

In the foregoing specification, the invention has been described withreference to specific examples of embodiments of the invention. It will,however, be evident that various modifications and changes may be madetherein without departing from the broader spirit and scope of theinvention as set forth in the appended claims.

Moreover, the terms “front,” “back,” “top,” “bottom,” “over,” “under”and the like in the description and in the claims, if any, are used fordescriptive purposes and not necessarily for describing permanentrelative positions. It is understood that the terms so used areinterchangeable under appropriate circumstances such that theembodiments of the invention described herein are, for example, capableof operation in other orientations than those illustrated or otherwisedescribed herein.

The connections as discussed herein may be any type of connectionsuitable to transfer signals from or to the respective nodes, units ordevices, for example via intermediate devices. Accordingly, unlessimplied or stated otherwise, the connections may for example be directconnections or indirect connections. The connections may be illustratedor described in reference to being a single connection, a plurality ofconnections, unidirectional connections, or bidirectional connections.However, different embodiments may vary the implementation of theconnections. For example, separate unidirectional connections may beused rather than bidirectional connections and vice versa. Also,plurality of connections may be replaced with a single connection thattransfers multiple signals serially or in a time multiplexed manner.Likewise, single connections carrying multiple signals may be separatedout into various different connections carrying subsets of thesesignals. Therefore, many options exist for transferring signals.

Although specific conductivity types or polarity of potentials have beendescribed in the examples, it will be appreciated that conductivitytypes and polarities of potentials may be reversed.

Each signal described herein may be designed as positive or negativelogic. In the case of a negative logic signal, the signal is active lowwhere the logically true state corresponds to a logic level zero. In thecase of a positive logic signal, the signal is active high where thelogically true state corresponds to a logic level one. Note that any ofthe signals described herein may be designed as either negative orpositive logic signals. Therefore, in alternate embodiments, thosesignals described as positive logic signals may be implemented asnegative logic signals, and those signals described as negative logicsignals may be implemented as positive logic signals.

Furthermore, the terms “assert” or “set” and “negate” (or “deassert” or“clear”) are used herein when referring to the rendering of a signal,status bit, or similar apparatus into its logically true or logicallyfalse state, respectively. If the logically true state is a logic levelone, the logically false state is a logic level zero. And if thelogically true state is a logic level zero, the logically false state isa logic level one.

Those skilled in the art will recognize that the boundaries betweenlogic blocks are merely illustrative and that alternative embodimentsmay merge logic blocks or circuit elements or impose an alternatedecomposition of functionality upon various logic blocks or circuitelements. Thus, it is to be understood that the architectures depictedherein are merely exemplary, and that in fact many other architecturesmay be implemented which achieve the same functionality.

Any arrangement of components to achieve the same functionality iseffectively “associated” such that the desired functionality isachieved. Hence, any two components herein combined to achieve aparticular functionality may be seen as “associated with” each othersuch that the desired functionality is achieved, irrespective ofarchitectures or intermedial components. Likewise, any two components soassociated can also be viewed as being “operably connected,” or“operably coupled,” to each other to achieve the desired functionality.

Furthermore, those skilled in the art will recognize that boundariesbetween the above described operations merely illustrative. The multipleoperations may be combined into a single operation, a single operationmay be distributed in additional operations and operations may beexecuted at least partially overlapping in time. Moreover, alternativeembodiments may include multiple instances of a particular operation,and the order of operations may be altered in various other embodiments.

Also for example, in one embodiment, the illustrated examples may beimplemented as circuitry located on a single integrated circuit orwithin a same device. Alternatively, the examples may be implemented asany number of separate integrated circuits or separate devicesinterconnected with each other in a suitable manner.

Also for example, the examples, or portions thereof, may implemented assoft or code representations of physical circuitry or of logicalrepresentations convertible into physical circuitry, such as in ahardware description language of any appropriate type.

Also, the invention is not limited to physical devices or unitsimplemented in non-programmable hardware but can also be applied inprogrammable devices or units able to perform the desired devicefunctions by operating in accordance with suitable program code, such asmainframes, minicomputers, servers, workstations, personal computers,notepads, personal digital assistants, electronic games, automotive andother embedded systems, cell phones and various other wireless devices,commonly denoted in this application as ‘computer systems’.

However, other modifications, variations and alternatives are alsopossible. The specifications and drawings are, accordingly, to beregarded in an illustrative rather than in a restrictive sense.

In the claims, any reference signs placed between parentheses shall notbe construed as limiting the claim. The word ‘comprising’ does notexclude the presence of other elements or steps then those listed in aclaim. Furthermore, the terms “a” or “an,” as used herein, are definedas one or more than one. Also, the use of introductory phrases such as“at least one” and “one or more” in the claims should not be construedto imply that the introduction of another claim element by theindefinite articles “a” or “an” limits any particular claim containingsuch introduced claim element to inventions containing only one suchelement, even when the same claim includes the introductory phrases “oneor more” or “at least one” and indefinite articles such as “a” or “an.”The same holds true for the use of definite articles. Unless statedotherwise, terms such as “first” and “second” are used to arbitrarilydistinguish between the elements such terms describe. Thus, these termsare not necessarily intended to indicate temporal or otherprioritization of such elements. The mere fact that certain measures arerecited in mutually different claims does not indicate that acombination of these measures cannot be used to advantage.

While certain features of the invention have been illustrated anddescribed herein, many modifications, substitutions, changes, andequivalents will now occur to those of ordinary skill in the art. It is,therefore, to be understood that the appended claims are intended tocover all such modifications and changes as fall within the true spiritof the invention.

We claim:
 1. A device that comprises: a set of multiple pipeline stages,wherein the set of multiple pipeline stages is arranged to execute afirst thread of instructions; multiple memristor based registers thatare arranged to store a state of another thread of instructions thatdiffers from the first thread of instructions; and a control circuitthat is arranged to control a thread switch between the first thread ofinstructions and the other thread of instructions by controlling astorage of a state of the first thread of instructions at the multiplememristor based registers and by controlling a provision of the state ofthe other thread of instructions by the set of multiple pipeline stages;wherein the set of multiple pipeline stages is arranged to execute theother thread of instructions upon a reception of the state of the otherthread of instructions.
 2. The device according to claim 1, wherein thememristor based registers comprise spin torque transfer magnetoresistivememory elements.
 3. The device according to claim 1, wherein thememristor based registers comprise resistive memory elements.
 4. Thedevice according to claim 3, wherein the resistive memory elements areformed in close proximity to the multiple pipeline stages.
 5. The deviceaccording to claim 3, wherein the resistive memory elements arepositioned directly above portions of the set of multiple pipelinestages.
 6. The device according to claim 1, wherein duration of thethread switch does not a time required to re-fill the pipeline.
 7. Thedevice according to claim 1, wherein each pipeline stage is followed bya memristor based register.
 8. The device according to claim 1, whereinthe storage of the state of the first thread of instructions at themultiple memristor based registers is preceded by extracting the stateof the first thread of instructions from the set of multiple pipelinestages.
 9. The device according to claim 8, wherein an aggregateduration of the extracting of the state of the first thread ofinstructions and the storage of the state of the first thread ofinstructions exceeds a duration of the provision of the state of theother thread of instructions.
 10. The device according to claim 1,wherein the multiple memristor based registers are arranged to store astate of each one out of multiple other thread of instructions thatdiffer from the first thread of instructions; and wherein the controlcircuit is arranged to control thread switches between any instructionsout of the first thread if instructions and any one of the other threadsof instructions.
 11. The device according to claim 10, wherein themultiple other thread of instructions exceed three threads ofinstructions.
 12. The device according to claim 10, wherein the multipleother thread of instructions exceed ten threads of instructions.
 13. Thedevice according to claim 10, wherein the multiple memristor basedregisters comprises multiple layers, wherein each layer is dedicated forstoring the status of a single other thread of instructions.
 14. Thedevice according to claim 10 wherein the memristor based registerscomprise resistive memory elements; wherein each other thread ofinstructions is stored in a memristive-based layer of the memristorbased registers.
 15. A method, comprising: executing, by a set ofmultiple pipeline stages, a first thread of instructions; storing, bymultiple memristor based registers, a state of another thread ofinstructions that differs from the first thread of instructions; andexecuting a thread switch between the first thread of instructions andthe other thread of instructions; wherein the executing of the threadswitch comprises: storing a state of the first thread of instructions atthe multiple memristor based registers; and providing the state of theother thread of instructions by the set of multiple pipeline stages;wherein the state of the other thread of instructions facilitates anexecuting of the other thread of instructions.
 16. The method accordingto claim 15, wherein the memristor based registers comprise spin torquetransfer magnetoresistive memory elements.
 17. The method according toclaim 15, wherein the memristor based registers comprise resistivememory elements.
 18. The method according to claim 17, wherein theresistive memory elements are formed in close proximity to the multiplepipeline stages.
 19. The method according to claim 17, wherein theresistive memory elements are positioned directly above portions of theset of multiple pipeline stages.
 20. The method according to claim 15,wherein a duration of the thread switch does not exceed 3 clock cyclesof a clock signal provided to the set of multiple pipeline stages. 21.The method according to claim 15, wherein each pipeline stage isfollowed by a memristor based register.
 22. The method according toclaim 15, wherein the storing of the state of the first thread ofinstructions at the multiple memristor based registers is preceded byextracting the state of the first thread of instructions from the set ofmultiple pipeline stages.
 23. The method according to claim 8, whereinan aggregate duration of the extracting of the state of the first threadof instructions and the storing of the state of the first thread ofinstructions exceeds a duration of the provision of the state of theother thread of instructions.
 24. The method according to claim 15,comprising storing by the multiple memristor based registers a state ofeach one out of multiple other thread of instructions that differ fromthe first thread of instructions; and wherein the method comprisesexecuting a thread switch between any thread of instructions out of thefirst thread of instructions and the multiple other threads ofinstructions.
 25. The method according to claim 24, wherein the multipleother thread of instructions exceed three threads of instructions. 26.The method according to claim 24, wherein the multiple other thread ofinstructions exceed ten threads of instructions.
 27. The methodaccording to claim 24, wherein the multiple memristor based registerscomprises multiple layers, wherein the storing comprise storing by eachlayer a status of a single other thread of instructions.
 28. The methodaccording to claim 24 wherein the memristor based registers compriseresistive memory elements; wherein the method comprises storing eachother thread of instructions is in a memristive-based layer of thememristor based registers.